-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(core): sets default chunk_size and sends buffer > chunk_size directly #4710
Conversation
0eaaa22
to
f1ab650
Compare
f1ab650
to
0e38441
Compare
The docs CI for the main branch is broken. I guess the failure of this check is unrelated to this PR.
|
Thanks for you PR, I will take a review this weekend 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your PR! This change affects our behavior of chunk
in write
process. Would you like to update the comments for chunk
too? Like:
opendal/core/src/types/operator/operator.rs
Lines 918 to 932 in 78692ce
/// ## `chunk` | |
/// | |
/// Set `chunk` for the writer. | |
/// | |
/// OpenDAL is designed to write files directly without chunking by default, giving users | |
/// control over the exact size of their writes and helping avoid unnecessary costs. | |
/// | |
/// This is not efficient for cases when users write small chunks of data. Some storage services | |
/// like `s3` could even return hard errors like `EntityTooSmall`. Besides, cloud storage services | |
/// will cost more money if we write data in small chunks. | |
/// | |
/// Set a good chunk size might improve the performance, reduce the API calls and save money. | |
/// | |
/// The following example will set the writer chunk to 8MiB. Only one API call will be sent at | |
/// `close` instead. |
pub struct ChunkedWriter<W: oio::Write> { | ||
inner: W, | ||
|
||
/// The size for buffer, we will flush the underlying storage at the size of this buffer. | ||
chunk_size: usize, | ||
/// If `exact` is true, the size of the data written to the underlying storage is | ||
/// exactly `chunk_size` bytes. | ||
exact: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea about introducing exact
.
@@ -49,6 +55,12 @@ impl<W: oio::Write> oio::Write for ChunkedWriter<W> { | |||
self.buffer.advance(written); | |||
} | |||
|
|||
if !self.exact && bs.len() >= self.chunk_size && self.buffer.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can calculate if bs.len() + self.buffer.len() >= self.chunk_size
and write them at once. This could be helpful for cases like: write(1KiB)
, write(16MiB)
, write(1KiB)
.
Fixed in d6a0449 |
Some tests failed. I'll take a look. |
core/src/layers/complete.rs
Outdated
}); | ||
size | ||
}); | ||
let exact = capability.write_multi_align_size.is_some(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I just noticed that exact
doesn't behave as I expected. If user has configured one or write_multi_align_size is some, we should use exact
. Some storage services requires to use the exactly the same chunk for uploading, like R2
Object part sizes must be at least 5MiB but no larger than 5GiB. All parts except the last one must be the same size.
self.buffer.advance(written); | ||
// We didn't sent `bs`. | ||
return Ok(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write returning 0
often been treated as an error WriteZero. It's better not to do this.
We can push only 1B in the buffer.
This case is extremely unusual, and almost no other service behaves this way. Those rules are just a safety measure. We can make it clear in the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about pushing written
bytes to the buffer queue? We already sent written
bytes so the queue should have "written" bytes space left.
I'll push a commit later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
What's changed
Currently,
CompleteLayer
sanitizes the chunk size when users callwith_chunk()
explicitly.https://docs.rs/opendal/latest/opendal/raw/struct.OpWrite.html#method.with_chunk
If users don't provide a chunk size, the writer won't guarantee the body size is larger than the minimum multi-part upload size. It's easy to forget to set the correct chunk size for the writer.
This PR changes this behavior:
write_multi_min_size
orwrite_multi_align_size
as the chunk size if users don't provide one.ChunkedWriter
attempts to send thebs
directly if it is larger than the chunk sizeexact
to theChunkedWriter
. Ifexact
is true, the writer still ensures the upload body size is exactly the chunk size.write_multi_align_size
will setexact
to true.chunk
size,exact
will also be true.There is a breaking change to
ChunkedWriter::new()
.Reference
GreptimeTeam/greptimedb#4098 (comment)